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Abstract 

We propose an off-line approach to explicitly encode temporal patterns spa¬ 
tially as different types of images, namely, Gramian Angular Fields and Markov 
Transition Fields. This enables the use of techniques from computer vision for 
feature learning and classification. We used Tiled Convolutional Neural Net¬ 
works to learn high-level features from individual GAF, MTF, and GAF-MTF 
images on 12 benchmark time series datasets and two real spatial-temporal tra¬ 
jectory datasets. The classification results of our approach are competitive with 
state-of-the-art approaches on both types of data. An analysis of the features 
and weights learned by the CNNs explains why the approach works. 

Keywords: Time-series, Trajectory, Classification, Gramian Angular Field, 
Markov Transition Field, Convolutional Neural Networks 


1. Introduction 

The problem of temporal data classification has attracted great interest re¬ 
cently, finding applications in domains as diverse as medicine, finance, entertain¬ 
ment, and industry. However, learning the complicated temporal correlations 
in complex dynamic systems is still a challenging problem. Inspired by recent 
successes of deep learning in computer vision, we consider the problem of en- 
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coding temporal information spatially as images to allow machines to ’’visually” 
recognize and classify temporal data, especially time series data. 

Recognition tasks in speech and audio have been well studied. Researchers 
have achieved success using combinations of HMMs with acoustic models based 
on Gaussian Mixture models (GMMs) [Il[2]. An alternative approach is to use 
deep neural networks to produce posterior probabilities over HMM states. Deep 
learning has become increasingly popular since the introduction of effective ways 
to train multiple hidden layers [3] and has been proposed as a replacement for 
GMMs to model acoustic data in speech recognition tasks [4]. These Deep 
Neural Network - Hidden Markov Model hybrid systems (DNN-HMM) achieved 
remarkable performance in a variety of speech recognition tasks O El |7] • Such 
success stems from learning distributed representations via deeply layered struc¬ 
ture and unsupervised pretraining by stacking single layer Restricted Boltzmann 
Machines (RBM). 

Another deep learning architecture used in computer vision is convolutional 
neural networks (GNNs) [8]. GNNs exploit translational invariance within their 
structures by extracting features through receptive fields [9] and learn with 
weight sharing. GNNs are the state-of-the-art approach in various image recog¬ 
nition and computer vision tasks [iniiiiiiig. Since unsupervised pretraining 
has been shown to improve performance m, sparse coding and Topographic 
Independent Component Analysis (TICA) are integrated as unsupervised pre¬ 
training approaches to learn more diverse features with complex invariances 

[HlIS]. 

GNNs were proposed for speech processing because of their invariance to 
shifts in time and frequency [16]. Recently, GNNs have been shown to further 
improve hybrid model performance by applying convolution and max-pooling in 
the frequency domain on the TIMIT phone recognition task m- A heteroge¬ 
neous pooling approach proved to be beneficial for training acoustic invariance 
m- Further exploration with limited weight sharing and a weighted softmax 
pooling layer has been proposed to optimize CNN structures for speech recog¬ 
nition tasks m- 
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However, except for audio and speech data, relatively little work has ex¬ 
plored feature learning in the context of typical time series analysis tasks with 
current deep learning architectures. [20] explores supervised feature learning 
with CNNs to classify multi-channel time series with two datasets. They ex¬ 
tracted subsequences with sliding windows and compared their results to Dy¬ 
namic Time Warping (DTW) with a 1-Nearest-Neighbor classifier (INN-DTW). 
Our motivation is to explore a novel framework to encode time series as images 
and thus to take advantage of the success of deep learning architectures in com¬ 
puter vision to learn features and identify structure in time series. Unlike speech 
recognition systems in which acoustic/speech data input is typically represented 
by concatenating Mel-frequency cepstral coefficients (MFCCs) or perceptual lin¬ 
ear predictive coefficient (PLPs) [21], typical time series data are not likely to 
benefit from transformations applied to speech or acoustic data. 

In this work, we propose two types of representations for explicitly encoding 
the temporal patterns in time series as images. We test our approach on twelve 
time series datasets produced from 2D shape, physiological surveillance, indus¬ 
try and other domains. Two real spatial-temporal trajectory datasets are also 
considered for experiments to demonstrate the performance of our approach. 
We applied deep Convolutional Neural Networks with a pretraining stage that 
exploits local orthogonality by Topographic ICA m to “visually” inspect and 
classify time series. We report our classification performance both on GAF and 
MTF separately, and GAF-MTF which resulted from combining GAF and MTF 
representations into single image. By comparing our results with the current 
best hand-crafted representation and classification methods on both time series 
and trajectory data, we show that our approach in practice achieves competitive 
performance with the state of the art with only cursory exploration of hyper¬ 
parameters. In addition to exploring the high level features learned by Tiled 
GNNs, we provide an in-depth analysis in terms of the duality between time 
series and images. This helps us to more precisely identify the reasons why our 
approaches work well. 
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2. Motivation 


Learning the (long) temporal correlations that are often embedded in time 
series remains a major challenge in time series analysis and modeling. Most real- 
world data has a temporal component, whether it is measurements of natural 
(weather, sound) or man-made (stock market, robotics) phenomena. Tradi¬ 
tional approaches for modeling and representing time-series data fall into three 
categories. In time series learning problems, non-data adaptive models, such 
as Discrete Fourier Transformation (DFT) [23], Discrete Wavelet Transforma¬ 
tion (DWT) [24], and Discrete Cosine Transformation (DCT) [25], compute the 
transformation with an algorithm that is invariant with respect to the data 
to capture the intrinsic temporal correlation with the different basis functions. 
Meanwhile, researchers explored in the model-based approaches to model time 
series, such as Auto-Regressive Moving Average models (ARMA) [26] and Hid¬ 
den Markov Models (HMMs) [27], in which the underlying data is assumed to 
fit a specific type of model to explicitly function the temporal patterns. The es¬ 
timated parameters can then be used as features for classification or regression. 
However, more complex, high-dimensional, and noisy real-world time-series data 
are often difficult to model because the dynamics are either too complex or un¬ 
known. Traditional methods, which contain a small number of non-linear oper¬ 
ations, might not have the capacity to accurately model such complex systems. 

If implicitly learning the complex temporal correlation is difficult, how about 
reformulating the data to explicitly or even visually encode the temporal de¬ 
pendency, allowing the algorithms to learn more easily? Actually, reformulating 
the features of time series as visual clues has raised much attention in com¬ 
puter science and physics. The typical examples in speech recognition tasks are 
that acoustic/speech data input is typically represented by MFCCs or PLPs 
to explicitly represent the temporal and frequency information. Recently, re¬ 
searchers are trying to build different network structures from time series for 
visual inspection or designing distance measures. Recurrence Networks were 
proposed to analyze the structural properties of time series from complex sys- 
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terns [28l [29]. They build adjacency matrices from the predefined recurrence 
functions to interpret the time series as complex networks. Silva et al. extended 
the recurrence plot paradigm for time series classification using compression 
distance m- Another way to build a weighted adjacency matrix is extracting 
transition dynamics from the first order Markov matrix m- Although these 
maps demonstrate distinct topological properties among different time series, 
it remains unclear how these topological properties relate to the original time 
series since they have no exact inverse operations. One of our contributions is to 
propose a set of off-line algorithm to encode the complex correlations in time se¬ 
ries into images for visual inspection and classification. The proposed encoding 
functions have exact/approximate inverse maps, making such transformations 
more interpret able. 

3. Encoding Methods 

We first introduce our two frameworks to encode time series data as im¬ 
ages. The first type of image is the Gramian Angular field (GAF), in which we 
represent time series in a polar coordinate system instead of the typical Garte- 
sian coordinates. In the Gramian matrix, each element is actually the cosine 
of the summation of pairwise temporal values. Inspired by previous work on 
the duality between time series and complex networks m, the main idea of the 
second framework, the Markov Transition Field (MTF), is to build the Markov 
matrix of quantile bins after discretization and encode the dynamic transition 
probability in a quasi-Gramian matrix. 

3.1. Gramian Angular Field 

Given a time series X = {xi, X 2 ,..., of n real-valued observations, we 
rescale X so that all values fall in the interval [—1,1] or [0,1] by: 


~i _ (xi—max(X)-\-(xi—min(X)) 

^—1 max{X)—min{X) 


( 1 ) 

( 2 ) 


or 


^0 — max{X)-min{X) 


Xi—min{X) 
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Thus we can represent the rescaled time series X in polar coordinates by en¬ 
coding the value as the angular cosine and the time stamp as the radius with 
the equations below: 

(/) = arccos (xi), —l<Xi<l,XieX 

. (3) 

r= tie N 

ti is the time stamp and N is a constant factor to regularize the span of 
the polar coordinate system. This polar coordinate based representation is a 
novel way to understand time series. As time increases, corresponding values 
warp among different angular points on the spanning circles, like water rippling. 
The encoding map of Eq. [^has two important properties. First, it is bijective 
as cos((/)) is monotonic when 0 G [0,7r]. Given a time series, the proposed 
map produces one and only one result in the polar coordinate system with a 
unique inverse function. Second, as opposed to Cartesian coordinates, polar 
coordinates preserve absolute temporal relations. In Cartesian coordinates, the 
area is defined by Sij = f{x{t))dx{t), we have Si^i+k = Sjj+k if fixit)) 

has the same values on [i, i k] and [j, j + k]. However, in polar coordinates, if 
the area is defined as S'^j = then S'- ^ j+k' That is, 

the corresponding area from time stamp i to time stamp j is not only dependent 
on the time interval |i — i|, but also determined by the absolute value of i and 

i- 

After transforming the rescaled time series into the polar coordinate system, 
we can easily exploit the angular perspective by considering the trigonometric 
sum between each pair of points to identify the temporal correlation in different 
time intervals. The GAF is defined as follows: 

cos(0i+0i) ••• cos((/)i+0n) 

COs(02+0l) ••• COs(02+0n) 

G = 

cos(072 T 0i) ' ’' cos(072 T 4^n) 

= X' -x- Vi - x^' ■ Vi-x^ 

I is the unit row vector [1,1,...,!]. After transforming to the polar coordinate 


(4) 

(5) 
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system, we take the data in a time series as a 1-D metric space. By defining the 


inner product < x^y >= x • y — ^/l — x‘^ • a/ 1 — G is a Gramian matrix: 


< Xi,Xi > 


< Xx,Xn > 


< X2,Xx > 


< X2,Xn > 


(6) 


_< af„, fi > ■ • ■ < af„, Xn > 

GAF has several advantages. It provides a way to preserve temporal depen¬ 
dency. When time increases, the position moves from top-left to bottom-right in 
the Gramian matrix. The GAF contains temporal correlations, as 
represents the relative correlation by superposition of directions with respect 
to time interval k. The main diagonal Gi^i is the special case when /c = 0, 
which contains the original value/angular information. With the main diagonal, 
we will approximately reconstruct the time series from the high level features 
learned by the deep neural network. The GAF images may be large because 
the size of the Gramian matrix is n x n when the length of the raw time series 
is n. To reduce the size of the GAF images, we apply Piecewise Aggregate 
Approximation [32j to smooth the time series while keeping the overall trends. 
The full procedure for generating the GAF is illustrated in Figure 

Through the polar coordinate system, GAFs actually represent the mutual 
correlations between each pair of points/phases by the superposition of the 
nonlinear cosine functions. Different types of time series always have their 
specific patterns embedded along the time and frequency dimensions. After the 
feature reformulation process by GAF, most different patterns are enhanced 
even for visual inspection by humans (Figure [^. 


3.2. Markov Transition Field 

We propose a framework that is similar to [31] for encoding dynamical tran¬ 
sition statistics. We develop that idea by representing the Markov transition 
probabilities sequentially to preserve information in the temporal dimension. 
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Gramian Angular Field 


Figure 1: Illustration of the proposed encoding map of the Gramian Angular Field. X is a 
sequence of typical time series in ’SwedishLeaf’ dataset. After X is rescaled by Equation. (??) 
and optionally smoothed by PAA , we transform it to a polar coordinate system by Equation. 
§ and finally calculate its GAF image with Equation. Hi- In this example, we build GAF 
without PAA smoothing, so the GAF has a high resolution of 128 x 128. 

Given a time series X, we identify its Q quantile bins and assign each Xi to its 
corresponding bin qj {j G [1, Q]). Thus we construct a Q x Q weighted adjacency 
matrix W by counting transitions among quantile bins in the manner of a first- 
order Markov chain along the time axis. Wij is the frequency with which a 
point in quantile Qj is followed by a point in quantile Qi. After normalization by 
Sj = 1, kb is the Markov transition matrix: 

'^ll\P{xteqi\xt-ieqi) '^lQ\P{xteqi\xt-ieqQ) 

'^21\P(xteq2\xt-ieqi) '^2Q\P(xteq2\xt-ieqQ) 

yQl\P(xteqQ\xt-ieqi) '^GG|P(a^te<7Q |cct-i e<7Q)_ 

It is insensitive to the distribution of X and the temporal dependency on 
the time steps However, getting rid of the temporal dependency results in 
too much information loss in the matrix W. To overcome this drawback, we 










Figure 2: Examples of GAF images on the ’Coffee’, ’Gun-Point’, ’Adiac’ and ’SOWords’ 
datasets. 

define the Markov Transition Field (MTF) as follows: 

'^ij\xieqi,xxeqj ' ’ ' '^ij\xieqi,Xneqj 

'^ij\x2eqi,xieqj ' ' ' ^ij\x2eqi,Xneqj 

IVl — (oj 

yij\xnEqi,xieqj ' ' ' "^ijlxn^qi^XnEqj_ 

We build a Q x Q Markov transition matrix W by dividing the data (mag¬ 
nitude) into Q quantile bins. The quantile bins that contain the data at time 
steps i and j (temporal axis) are qi and qj {q G [1,(5]). Mij in MTF denotes 
the transition probability of q^ qj. That is, we spread out matrix VF, which 
contains the transition probability on the magnitude axis, into the MTF matrix 
by considering temporal positions. 

By assigning the probability from the quantile at time step i to the quan¬ 
tile at time step j at each pixel the MTF M actually encodes multi-step 
transition probabilities of the time series. denotes the transition 

probability between the points with time interval k. For example, 
illustrates the transition process along the time axis with a skip step. The main 
diagonal Mu, which is a special case when k = 0 captures the probability from 
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Markov Transition Field 


Figure 3: Illustration of the proposed encoding map of a Markov Transition Field. X is a 
sequence of the typical time series in the ’ECG’ dataset. X is first discretized into Q quantile 
bins. Then we calculate its Markov Transition Matrix W and finally build its MTF M by 
Equation. We reduce the image size from 96 x 96 to 48 x 48 by averaging the pixels in 
each non-overlapping 2x2 patch. 


each quantile to itself (the self-transition probability) at time step i. To make 
the image size manageable for more efficient computation, we reduce the MTF 
by averaging the pixels in each non-overlapping m x m patch with the blurring 
kernel {^}mxm- That is, we aggregate the transition probabilities in each 
subsequence of length m together. Figureshows the procedure to encode time 
series to MTF. 

By scattering the first-order transition probability into the temporally or¬ 
dered matrix, MTFs encode the transition dynamics between different time lags 
k. We assume that different types of time series have their specific transition 
dynamics embedded in the temporal and frequency domains. After the feature 
reformulation process by MTF, most transition dynamics are extracted, which 
are explicitly obvious for visual inspection (Figure]^. 
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Figure 4: Examples of MTF images on the ’OSUleaf’, ’fish’, ’ECG’ and ’Faceall’ datasets. 

4. Tiled Convolutional Neural Networks 

Tiled Convolutional Neural Networks [15] are a variation of Convolutional 
Neural Networks. They use tiles and multiple feature maps to learn invariant 
features. Tiles are parameterized by a tile size K to control the distance over 
which weights are shared. By producing multiple feature maps, Tiled CNNs 
learn overcomplete representations through unsupervised pretraining with To¬ 
pographic ICA (TICA). 

A typical TICA network is actually a double-stage optimization procedure 
with square and square root nonlinearities in each stage, respectively. In the 
first stage, the weight matrix W is learned while the matrix V is hard-coded to 
represent the topographic structure of units. More precisely, given a sequence of 
inputs {x^}, the activation of each unit in the second stage is IT, V) = 

\JY%=i . TICA learns the weight matrix IT in the second 

stage by solving: 


n p 

minimize EE 

^ h=ii=i (9) 

subject to WW^ = I 

W G and V E where p is the number of hidden units in a layer 


II 











Algorithm 1 Unsupervised pretraining with TICA [15] 

Require: s^W^V as input 

Ensure: W as output 

repeat 

r"" = Et=iE“ 1 ^ET=iVik(EUw,jxf^)^ g = Q, = +oo, 

0 = 1 

while do 

wnew ^^r _^g 

wnew _ LocalizeiW^^^, s) 

wnew ^ tieWeights{W^^^,k) 
wnew _ orthogonalizeLocalRF 
wnew ^ tieWeightsiW^^^,k) 

fue^ = Y.ti EHi vSldWEjU^UUU 

a = 0.5o 

end while 

lU = 

until convergence 


and q is the size of the input. U is a logical matrix (Vij = 1 or 0) that encodes 
the topographic structure of the hidden units by a contiguous 3x3 block. The 
orthogonality constraint WW^ = I provides diversity among learned features. 

The pretraining algorithm (Algorithm. is based on gradient descent on 
the TICA objective function in Equation. The inner loop is a simple im¬ 
plementation of backtracking linesearch. The orthogonalize2ocalRF(W'^ew) 
function only orthogonalizes the weights that have completely overlapping re¬ 
ceptive fields. Weight-tying is applied by averaging each set of tied weights. 
The algorithm is trained by batch projected gradient descent. Other unsuper¬ 
vised feature learning algorithms such as RBMs and autoencoders [33] require 
more parameter tuning, especially during optimization. However, pretraining 
with TICA usually requires little tuning of optimization parameters, because 
the tractable objective function of TICA allows to monitor convergence easily. 
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Convolutional I TICA Pooling I Convolutional II Pooling II Linear SVM 

Feature maps Z = 6 



Figure 5: Structure of the tiled convolutional neural network. We fix the size of receptive 
fields to 8 X 8 in the first convolutional layer and 3 X 3 in the second convolutional layer. 
Each TICA pooling layer pools over a block of 3 x 3 input units in the previous layer without 
wraparound at the boarders to optimize for sparsity of the pooling units. The number of 
pooling units in each map is exactly the same as the number of input units. The last layer is 
a linear SVM for classification. We construct this network by stacking two Tiled CNNs, each 
with 6 maps (I = 6) and tiling size k = 2. 

Neither GAF nor MTF images are natural images; they have no natural 
concepts such as “edges” and “angles”. Thus, we propose to exploit the benefits 
of unsupervised pretraining with TICA to learn many diverse features with local 
orthogonality. In m, the authors empirically demonstrate that tiled CNNs 
perform well with limited labeled data because the partial weight tying requires 
fewer parameters and reduces the need for a large amount of labeled data. Our 
data from the UCR Time Series Repository [34] tends to have few instances 
(e.g., the “yoga” dataset has 300 labeled instance in the training set and 3000 
unlabeled instance in the test set), so tiled CNNs are suitable for our learning 
task. Moreover, Tiled CNNs achieve good performance on large datasets (such 
as NORB and CIFAR). 

Typically, tiled CNNs are trained with two hyperparameters, the tiling size 
k and the number of feature maps 1. In our experiments, we directly fixed 


13 

































Table 1: Summary statistics of 12 standard datasets 


DATASET 

CLASS 

TRAIN 

TEST 

LENGTH 

SOwords 

50 

450 

455 

270 

Adiac 

37 

390 

391 

176 

Beef 

5 

30 

30 

470 

Coffee 

2 

28 

28 

286 

ECG200 

2 

100 

100 

96 

FaceAll 

14 

560 

1,690 

131 

Lightning2 

2 

60 

61 

637 

Lightning? 

7 

70 

73 

319 

OliveOil 

4 

30 

30 

570 

OSULeaf 

6 

200 

242 

427 

SwedishLeaf 

15 

500 

625 

128 

Yoga 

2 

300 

3,000 

426 


the network structures without tuning these hyperparameters in loops. Our 
experimental settings follow the default deep network structures and parame¬ 
ters in [15]. Tiled CNNs with such configurations are reported to achieve the 
best performance on the NORB image classification benchmark. Although tun¬ 
ing the parameters will surely enhance performance, doing so may cloud our 
understanding of the power of the representation. Another consideration is 
computational efficiency. All of the experiments on the 12 datasets could be 
done in one day on a laptop with an Intel i7-3630QM CPU and 8GB of memory 
(our experimental platform). Thus, the results in this paper are a preliminary 
lower bound on the potential best performance. Thoroughly exploring network 
structures and parameters will be addressed in future work. The structure and 
parameters of the tiled CNN used in this paper are illustrated in Figure 


14 





5. Experiments on Time Series Data 


We apply Tiled CNNs to classify using GAF and MTF representation on 
twelve tough datasets, on which the classification error rate is above 0.1 with 
the state-of-the-art SAX-BoP approach [35l |22]. More detailed statistics are 
summarized in Table The datasets are pre-split into training and testing sets 
for experimental comparisons. For each dataset, the table gives its name, the 
number of classes, the number of training and test instances, and the length of 
the individual time series. 

5.1. Experiment Settings 

In our experiments, the size of the GAF image is regulated by the the number 
of PAA bins Sqaf- Given a time series X of size n, we divide the time series 
into Sgaf adjacent, non-overlapping windows along the time axis and extract 
the means of each bin. This enables us to construct the smaller GAF matrix 
GsgafxSgaf' MTF requires the time series to be discretized into Q quantile 
bins to calculate the Q x Q Markov transition matrix, from which we construct 
the raw MTF image M^xn afterwards. Before classification, we shrink the 
MTF image size to Smtf x Smtf by the blurring kernel where 

m = r Smtf 1 ‘ Tiled GNN is trained with image size {Sqaf^ Smtf} ^ 

{16, 24, 32,40,48} and quantile size Q G {8,16, 32, 64}. At the last layer of the 
Tiled GNN, we use a linear soft margin SVM [36] and select C by 5-fold cross 
validation over {10“^, 10“^,..., 10^} on the training set. 

For each input of image size Sqaf or Smtf and quantile size Q, we pretrain 
the Tiled GNN with the full unlabeled dataset (both training and test set with no 
labels) to learn the initial weights W through TIC A. Then we train the SVM at 
the last layer by selecting the penalty factor C with cross validation. Finally, we 
classify the test set using the optimal hyperparameters {S', Q, G} with the lowest 
error rate on the training set. If two or more models tie, we prefer the larger 
S and Q because larger S helps preserve more information through the PAA 
procedure and larger Q encodes the dynamic transition statistics with more 
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Table 2: Tiled CNN error rate on training set and test set 


DATASET 

GAE 

MTE 


TRAIN 

TEST 

TRAIN 

TEST 

50words 

0.338 

0.310 

0.442 

0.426 

adiac 

0.321 

0.284 

0.638 

0.665 

beef 

0.633 

0.4 

0.533 

0.233 

coffee 

0 

0 

0 

0 

ECG200 

0.16 

0.11 

0.15 

0.21 

faceall 

0.121 

0.244 

0.102 

0.259 

lighting2 

0.2 

0.18 

0.167 

0.361 

lighting? 

0.329 

0.397 

0.386 

0.411 

oliveoil 

0.2 

0.2 

0.033 

0.3 

OSULeaf 

0.415 

0.463 

0.43 

0.483 

SwedishLeaf 

0.134 

0.104 

0.206 

0.176 

yoga 

0.183 

0.177 

0.193 

0.243 


detail. Our model selection approach provides generalization without being 
overly expensive computationally. 

5.2. Results and Discussion 

We use Tiled CNNs to classify GAF and MTF representations separately 
on the 12 datasets. The training and test error rates are shown in Table 
Generally, our approach is not prone to overfitting as seen by the relatively 
small difference between training and test set errors. One exception is the 
’Olive Oil’ dataset with the MTF approach where the test error is significantly 
higher. 

In addition to the slight risk of potential overfitting, MTF has generally 
higher error rates than GAF. This is most likely because of uncertainty in the 
inverse image of MTF. Note that the encoding function from time series to GAF 
and MTF are both surjection. The map functions of GAF and MTF will each 
produce only one image with fixed S and Q for each given time series X. Because 
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they are both surjective mapping functions, the inverse image of the map is not 
fixed. As shown in a later section, we can approximately reconstruct the raw 
time series from GAF, but it is very hard to even roughly recover the signal from 
MTF. GAF has smaller uncertainty in the inverse image of its mapping function 
because randomness only comes from the ambiguity of cos(0) when 0 G [0, 27r]. 
MTF, on the other hand, has a much larger inverse image space, which results in 
large variation when we try to recover the signals. Although MTF encodes the 
transition dynamics, which are important features of time series, such features 
seem not to be sufficient for recognition/classification tasks. 

Note that at each pixel, denotes the superstition of the directions at ti 
and tj, Mij is the transition probability from the quantile at ti to the quantile 
at tj. GAF encodes static information while MTF depicts information about 
dynamics. From this point of view, we consider them as two “orthogonal” 
channels, like different colors in the RGB image space. Thus, we can combine 
GAF and MTF images of the same size (i.e. Sqaf = Smtf) to construct a 
double-channel image (GAF-MTF). Since GAF-MTF combines both the static 
and dynamic statistics embedded in raw time series, we posit that it will improve 
classification performance. In the next experiment, we pretrained and fine- 
tuned the Tiled GNN on the compound GAF-MTF images. Then, we report 
the classification error rate on test sets. 

Table [^compares the classification error rate of our approach with previously 
published results of five competing methods: two state-of-the-art INN classifiers 
based on Euclidean distance and DTW, the recently proposed Fast-Shapelets 
based classifier m, the classifier based on Bag-of-Pat ter ns (BoP) [35l [22] and 
the most recent SAX-VSM approach [38]. Our approach outperforms INN- 
Euclidean, fast-shapelets, and BoP, and is competitive with INN-DTW and 
SAX-VSM. 

In addition, by comparing the results between Table and Table we ver¬ 
ified our assumption that combined GAE-MTE images have better expressive 
power than the single GAE or MTE alone for classification. GAE-MTE im¬ 
ages achieves the lower test error rate on ten datasets out of twelve (except 
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Table 3: Summary of the error rates from 6 recently published best results and our approach. 
The symbols *, <1, f and • represent datasets generated from figure shapes (2D), physiological 
surveillance, industry and all remaining temporal signals, respectively. 


Dataset 

INN- 

Euclidean 

INN- 

DTW 

Fast 

shapelet 

SAX- 

BoP 

SAX- 

VSM 

RPCD 

GAF- 

MTF 

50words • 

0.369 

0.242 

0.4429 

0.466 

N/A 

0.226 

0.284 

Adiac * 

0.389 

0.391 

0.514 

0.432 

0.381 

0.384 

0.307 

Beef • 

0.467 

0.467 

0.447 

0.433 

0.33 

0.367 

0.3 

Coffee • 

0.25 

0.18 

0.067 

0.036 

0 

0 

0 

ECG200 <1 

0.12 

0.23 

0.227 

0.14 

0.14 

0.14 

0.08 

FaceAll * 

0.286 

0.192 

0.402 

0.219 

0.207 

0.191 

0.223 

Lightning2 f 

0.246 

0.131 

0.295 

0.164 

0.196 

0.246 

0.18 

Lightning7 f 

0.425 

0.274 

0.403 

0.466 

0.301 

0.356 

0.397 

OliveOil • 

0.133 

0.133 

0.213 

0.133 

0.1 

0.167 

0.167 

OSULeaf * 

0.483 

0.409 

0.359 

0.236 

0.107 

0.355 

0.446 

SwedishLeaf * 

0.213 

0.21 

0.27 

0.198 

0.251 

0.098 

0.093 

Yoga * 

0.17 

0.164 

0.249 

0.17 

0.164 

0.134 

0.16 

WINS# 

0 

3 

0 

0 

3 

3 

5 


for the ’Adiac’ and ’Beef’ dataset ). On the ’Olive Oil’ dataset, the training 
error rate is 6.67% and the test error rate is 16.67%. This demonstrates that 
the integration of both types of images into one compound image decreases the 
risk of overfitting as well as enhancing the overall classification accuracy. Thus, 
the intrinsic ’’orthogonality” between GAF and MTF on the same time series 
helps improve the classification performance with more comprehensive features. 
The multi-channel encoding approach is a scalable framework. The combination 
of multiple orthogonal channels into one images potentially improve the clas¬ 
sification results, decreasing the risk of overfitting by a generalized ensemble 
framework. Meanwhile, hand-crafted feature integration potentially helps learn 
different informative features through deep learning architectures. 
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Figure 6: (a) Original GAF and its six learned feature maps before the SVM layer in Tiled 
CNN (top left), and (b) raw time series and approximate reconstructions based on the main 
diagonal of six feature maps (top right) on ’SOWords’ dataset; (c) Original MTF and its six 
learned feature maps before the SVM layer in Tiled CNN (bottom left), and (d) curve of 
self-transition probability along time axis (main diagonal of MTF) and approximate recon¬ 
structions based on the main diagonal of six feature maps (bottom right) on ’’SwedishLeaf’ 
dataset. 

5.3. Analysis of Learned Features 

In contrast to the cases in which the CNN is applied in natural image recogni¬ 
tion tasks, neither GAF nor MTF have natural interpretations of visual concepts 
(e.g., ’’edges” or “angles”). In this section, we analyze the features and weights 
learned through the Tiled CNNs to explain why our approach works. 

As mentioned earlier, the mapping function from time series to GAF is 
surjective and the uncertainty in its inverse image comes from the ambiguity of 
cos((/)) when G [0,27r]. The main diagonal of GAF, i.e. {Go} = {cos(2(/)i)} 
allows us to approximately reconstruct the original time series, ignoring the 
signs, by 

cos(0) = ^^|^- (10) 

MTF has much larger uncertainty in its inverse image, making it hard to 
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reconstruct the raw data from MTF alone. However, the diagonal 
represents the transition probability among the quantiles in temporal order con¬ 
sidering the time interval k. We construct the self-transition probability along 
the time axis from the main diagonal of MTF like we do for GAF. Although 
such reconstructions less accurately capture the morphology of the raw time se¬ 
ries, they provide another perspective of how Tiled CNNs capture the transition 
dynamics embedded in MTF. 

Figure [^illustrates the reconstruction results from six feature maps learned 
before the last SVM layer on the GAF and MTF. The Tiled CNN extracts 
the color patch, which is essentially an adaptive moving average that enhances 
several receptive fields within the nonlinear units by different trained weights. It 
is not a simple moving average but the synthetic integration by considering the 
2D temporal dependencies among different time intervals, which is a benefit from 
the Gramian matrix structure that helps preserve the temporal information. By 
observing the rough orthogonal reconstruction from each layer of the feature 
maps, we can clearly observe that the CNNs can extract the multi-frequency 
dependencies through the convolution and pooling architecture on the GAF and 
MTF images. Different feature maps preserve the overall trend while addressing 
more details in different subphases. As shown in Figures |^b) andj^d), the 
high-leveled feature maps learned by the Tiled CNN are equivalent to a multi¬ 
frequency approximator of the original curve. 

Figure demonstrates the learned sparse weight matrix W with the con¬ 
straint WW^ = /, which makes effective use of local orthogonality. The TICA 
pretraining provides the built-in advantage that the function w.r.t the parame¬ 
ter space is not likely to be ill-conditioned as WW^ = 1. As shown in Figure 
(right), the weight matrix W is quasi-orthogonal and approaching 0 without 
very large magnitude. This implies that the condition number of W approaches 
1 and helps the system to be well-conditioned. 
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Figure 7: learned sparse weights W for the last SVM layer in Tiled CNN (left) and its 
orthogonality constraint by WW^ = I (right). 

6. Experiments on Trajectory Data 

We have demonstrated the effectiveness of GAF and MTF the benchmark 
time series datasets as diverse as shape, physiological surveillance and industry 
from the UCR time series repository. In this section we describe an application 
of our approaches to classify spatial-temporal trajectory data. The trajectory 
data is complex because patterns of movement are often driven by unperceived 
goals and constrained by an unknown environment. 

To compare our results with other benchmark approaches including the sem¬ 
inal work from [39], we run experiments on two benchmark datasets, the animal 
movement dataset (Animal) and the hurricane track dataset (Hurricane) (Fig¬ 
ure]^. Both datasets have trajectories of unequal length. For the ’’Animal” 
dataset, the x and y coordinates are extracted from animal movements observed 
in June 1995. It is divided into three classes by species: elk, deer, and cattle, 
as shown in Figure 15. The numbers of trajectories (points) are 38 (7117), 30 
(4333), and 34 (3540), respectively. In the ’’Hurricane” dataset, the latitude 
and longitude are extracted from Atlantic hurricanes for the years 1950 through 
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Red: Elk Blue: Deer Black: Cattle Red: Category 2 Blue: Category 3 


Figure 8: Overview of the trajectory and the RB-TB features m learnt in (a). Animal 
tracking data (left) and (b). Hurricane data (right) 

2006. The Saffir-Simpson scale classifies hurricanes into categories 1-5 by inten¬ 
sity. A high category number indicates high intensity. Categories 2 and 3 are 
chosen for two classes. The numbers of trajectories (points) are 61 (2459) and 
72 (3126), respectively. Both datasets are pre-split into two parts for training 
(80%) and testing (20%). Figure shows the overview of the trajectory data. 
Table provides the classes, training size, testing size, minimum length and 
maximum length of the trajectory data. 

6.1. Hilbert Space Filling Curves 

Spatial-temporal trajectory data is commonly multi-dimensional. We use 
Hilbert Space Filling Curves (SFC) to transform the trajectory into time series 
while preserving the spatial-temporal information. 

Space filling has been studied by the mathematicians since the late 19th 
century when the first graphical representation was proposed by David Hilbert 


Table 4: Summary statistics of two trajectory datasets. 


Dataset 

Classes 

Training 

Testing 

Min 

Max 



Size 

Size 

Length 

Length 

Animal Tracking 

3 

80 

18 

10 

291 

Hurricane 

2 

112 

21 

11 

108 
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Figure 9: (a). Hilbert space filling curve of order {1,2,3,4,5,6} in 2-dimensional space (left) 
(b). An example of the transformation from 2-dimensional trajectory to 1-dimensional time 
series using HSCF of order 2 (right). 

in 1891 [40]. Space filling curves provide a linear mapping from the multi¬ 
dimensional space to the 1-dimensional space. This mapping can be thought 
of as dividing D-dimensional space into D-dimensional hypercubes with a line 
passing through each hypercube. Recently, filling curve based approaches have 
shown to be able to preserve locality between objects in the multidimensional 
space in the linear space, and thus have been applied to different tasks like 
clustering [41], high dimensional outlier detection [42], and trajectory query [43] 
and classification [44|. Figure]^ (a) shows SFC examples of order {1,2,3,4,5,6}. 

Basically, the SFC of order 1 divides the square into 4 area. For the Hilbert 
curve with order 2, each sub-area of the curve with order 1 is further divided 
into 4 sub-areas. This process goes on as the order of the SFC increases. It 
is clear that the number of sub-areas in 2 dimensional SFC is , To con¬ 

vert 2-dimensional data points to 1-dimensional points, each sub-area is integer 
numbered from 0 to — 1 starting from the lower left corner as 0 to the 

lower right corner. All other sub-areas are numbered in order of occurrence 
of the corresponding vertex as shown in Figure]^ (b) when order = 2. It also 
shows the example transformation process from a 2D trajectory to a sequence of 
scalars (time series). The final time series generated after SFC transformation 
is T = [0, 3, 2, 2, 2, 7, 7, 8, 11, 13, 13, 2, 1, 1]. 

We map the trajectory points by the visiting order of the SFC embedded 
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in the trajectory manifold space to the index sequence by the recorded times. 
The produced time series can be used for classification using our algorithm. 
This adds another hyperparameter called the SFC order, which decides the 
granularity of the space filling curve. 

6.2. Experiment Settings 

The parameter settings are the same as the previous experiments on UCR 
datasets (Section |^. The optimal SFC order is selected together with other 
parameters through 5-fold cross validation from {3,4,5,6,7,8,9,10}. 

Note that both trajectory datasets have quite small sample size with varying 
length. When the trajectory length (as well as the time series length produced 
by SFC) is smaller than image size S', we uniformly duplicate each point in 
the time series in temporal order to stretch the sequence to length S. If the 
difference between the length of a time series and S is smaller than the original 
time series length, the interpolation strategy changes to random duplication 
instead of following the temporal order. 

6.3. Results and Diseussion 

Both ’Animal’ and ’Hurricane’ datasets have been used in previous research 
[39l |44] to achieve state-of-the-art classification accuracy. Traclass give two al¬ 
gorithms, trajectory-based (TB-only) and region-based + trajectory-based (RB- 
TB) approaches based on features used for classification on these datastes. They 
carefully designed a hierarchy of features by partitioning trajectories and ex¬ 
ploring two types of clustering. In [44], the author used SFC transformation to 
linearly map the trajectory data to time series and classified the sequences based 
on symbolic discretization with the multiple normal distribution assumption. 

After transforming the 2D trajectory data to time series using SFC, we gen¬ 
erate the corresponding GAF and MTF images as shown in Figure [T q| However, 
we found significant overfitting with CNNs even using 5-fold cross validation. 
This is probably because both the sample size and the time series length of 
the trajectory datasets are too small to avoid overfitting in neural networks. 
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(c). MTF of 'Animal' (d). MTF of 'Flurricane' 


Figure 10: Examples of GAF and MTF images generated from the time series on ’Animal’ 
and ’Hurricane’ datasets. The time series is produced using SFC from raw 2D trajectory. 


Previous work has discussed overfitting during cross validation and proposed 
potential techniques to address this problem [45j|46]. Here, we applied a sim¬ 
ple and straight-forward hyperparameter selection approach to reduce classifier 
variance. For a given set of hyperparameter {S^Q, SFCorder}^ after cross val¬ 
idation with different C values of the linear SVM, we compute the mean and 
standard deviation to get the 3cr lower bound over all C by 


score^cj = mean{ Accuracy) — 3 x STD (Accuracy) (11) 

By selecting the other hyperparameters {S', Q, SFC — order} with the best 
statistical lower bound on the classifier performance over C, the optimal hy¬ 
perparameters have lower variance while preserving lower bias. Using this hy¬ 
perparameter selection approach, the classification results are reported in Table 

m 

We perform better than the TB-Only method on both datasets and almost 
as good as the RB-TB method on the ’Hurricane’ dataset. However, both RB- 
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Table 5: Classification accuracy for TB-Only, RB-TB methods, multiple normal distribution 
based symbolic distance (NDist) and our algorithm (%). 


Dataset 

TB-Only 

RB-TB 

NDist 

GAF-MTF 

Animal Tracking 

50 

83.3 

83.3 

72.2 

Hurricane 

65.4 

73.1 

52.3 

71.42 


TB and NDist methods outperform ours on the ’Animal’ dataset. As shown in 
Figure both region and trajectory based features are useful for classification. 
For the ’Hurricane’ dataset, direction based features are more useful than region 
based features. Direction based features are quite easy to capture using our 
approach as the GAF is actually calculating the pairwise direction fields on 
each points in the trajectory data. For the ’Animal’ dataset, region is very 
important as shown in Figure (a). Elk, deer and cattle are almost separable 
just using location as their regions are clearly located at the left, right top 
and right bottom, respectively. When transforming the trajectory data into 
time series using SFC, two close regions might be mapped to different sub-areas 
with different SFC indexes. When the indexes of two close regions are also 
near, this can be handled by CNNs with its capability to capture the small 
shifting-invariance features. However, CNNs are not good at discriminating 
similar images with large shifting from each other. Thus, when the region 
information is preserved by the manner of shifting the specific patterns largely 
in the time series produced by SFC, CNNs might have difficulty capturing the 
region information. 

Although our approach does not overtake other benchmark methods on both 
trajectory datasets, we provide a more general framework to encode the spatial- 
temporal patterns for classification tasks. Instead of complicated hand-tuned 
features, our approach can be applied to a variety of time series and trajectory 
data. When the region of the trajectory is not significantly important or the 
direction feature dominates, our general methods work quite well. On large 
datasets where the volume of time series/trajectory data is big, our deep neural 
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network based approach will greatly benefit from the large sample size in both 
feature learning and classification tasks. 

7. Conclusions and Future Work 

This paper proposed an off-line approach to spatially encode the tempo¬ 
ral patterns for classification using convolutional neural networks. We created 
a pipeline for converting trajectory and time series data into novel representa¬ 
tions, GAF and MTF images, and extracted high-level features from these using 
CNNs. The features were subsequently used for classification. We demonstrated 
that our approach yields competitive results when compared to state-of-the-art 
methods by searching a relatively small parameter space. We found that GAF- 
MTF multi-channel images are scalable to larger numbers of quasi-orthogonal 
features that yield more comprehensive images. Our analysis of high-level fea¬ 
tures learned from GNNs suggested Tiled GNNs work like multi-frequency mov¬ 
ing averages that benefit from the 2D temporal dependency that is preserved 
by the Gramian matrix. 

Important future work will involve applying our method to massive amounts 
of data and searching in a more complete parameter space to solve real world 
problems. We are also quite interested in how different deep learning architec¬ 
tures perform on the GAF and MTF images generated from large datasets. An¬ 
other interesting future direction is to model time series through GAF and MTF 
images. We aim to apply learned time series models in regression/imputation 
and anomaly detection tasks. To extend our methods to the streaming data, we 
suppose to design the online learning approach with recurrent network struc¬ 
tures to represent, learn and model temporal data in real-time. 
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